1 



Formation of regulatory modules by local sequence duplication 

Armita Nourmohammad 1 , Michael Lassig 1 

1 Institute for Theoretical Physics, University of Cologne, Ziilpicherstr. 77, 50937 Koln, 
Germany 

Abstract 

Turnover of regulatory sequence and function is an important part of molecular evolution. But what are 
the modes of sequence evolution leading to rapid formation and loss of regulatory sites? Here, we show 
that a large fraction of neighboring transcription factor binding sites in the fly genome have formed from 
a common sequence origin by local duplications. This mode of evolution is found to produce regulatory 
information: duplications can seed new sites in the neighborhood of existing sites. Duplicate seeds evolve 
subsequently by point mutations, often towards binding a different factor than their ancestral neighbor 
sites. These results are based on a statistical analysis of 346 cis-regulatory modules in the Drosophila 
melanogaster genome, and a comparison set of intergenic regulatory sequence in Saccharomyces cerevisiae. 
In fly regulatory modules, pairs of binding sites show significantly enhanced sequence similarity up to 
distances of about 50 bp. We analyze these data in terms of an evolutionary model with two distinct 
modes of site formation: (i) evolution from independent sequence origin and (ii) divergent evolution 
following duplication of a common ancestor sequence. Our results suggest that pervasive formation of 
binding sites by local sequence duplications distinguishes the complex regulatory architecture of higher 
eukaryotes from the simpler architecture of unicellular organisms. 

Introduction 

The importance of regulatory variations as a driving force for phcnotypic evolution has been suggested 
some time ago [1,2]. However, a quantitative understanding of gene regulation has become possible 
only after the advent of large-scale genomic sequence and regulatory interaction data. Important build- 
ing blocks are genome- wide maps of protein-DNA binding, statistical inference methods [3,4], high- 
throughput measurements of sequence-specific binding affinities of transcription factors [5-8] , and cross- 
species comparisons of regulatory sequences and regulatory functions [9] . 
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The resulting picture is quite diverse: Core parts of developmental regulatory networks can be con- 
served over large evolutionary ranges [10], and individual promoters in flies are found to be conserved in 
function over large evolutionary distances [11,12]. Functional changes in promoters have been identified 
as well, but the relative roles of adaptive and near-neutral evolution remain to be clarified. The sequences 
in regulatory DNA regions evolve under less constraint than their functional output. This feature can be 
explained by wide-spread compensatory changes, which have been identified between different nucleotides 
within individual binding sites as well as between different sites within a promoter [11-17]. At the pro- 
moter level, this dynamics includes loss and gain of binding sites, the rates of which have been estimated 
in flies and yeast [13, 18, 19]. The observed site turnover is consistent with moderate negative selection 
acting on individual sites [13,20], whereas the function of entire promoters is under stronger stabilizing 
selection [11]. 

The evolutionary constraint of regulatory sequence and function depends on the level of complexity 
in promoter architecture. Prokaryotes and unicellular eukaryotes have short intergenic regions, and 
regulatory functions are often encoded by few binding sites. The more complex cis-regulatory information 
in higher eukaryotes is organized into regulatory modules, which are typically a few hundred base pairs 
long and are spatially separated by larger segments of intergenic DNA [21,22]. Within modules, regulatory 
functions often depend on clusters of neighboring binding sites for multiple transcription factors, which are 
coupled by cooperative interaction [23-27]. Bioinformatic algorithms trace such site clusters to identify 
regulatory DNA regions [28-34] . The relative order and spacing of sites within clusters follows a regulatory 
"grammar" , which distinguishes functionally neutral site changes from rearrangements affecting promoter 
function [17,35-39]. The combinatorial complexity of this grammar ensures the specificity of regulation in 
the larger genomes of multicellular eukaryotes [40,41]. At the same time, the grammar is flexible enough 
to allow substantial sequence evolution in a regulatory module while maintaining its overall functional 
output. 

In addition to point mutations, sequence insertions and deletions (indcls) play a significant role in this 
dynamics. Several studies have noted the prevalence of repetitive sequence elements in promoter regions 
and their potential influence on regulatory function [42-49]. In particular, a recent detailed analysis of the 
evolutionary rates of short tandem repeats in Drosophila has shown a net surplus of insertions, suggesting 
that these repeats may produce new regulatory sequence [48]. But to what extent is this actually the 
case? A priori, the link between repeat evolution and regulation is far from obvious: Duplications in 
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repeats can either be part of the neutral background evolution in regulatory sequences, or increase the 
spacing between existing binding sites of a regulatory module, or contribute to the formation of new 
sites. Disentangling these roles is subtle, because detected tandem repeats in contemporary sequence 
overlap with only a small fraction of binding sites, motif size and total length of most repeats are shorter 
than length and spacing of typical binding sites in a cluster, and repeat lifetimes are much shorter than 
conservation times of regulatory elements [49] . Hence, the role of repeat dynamics for regulation is an 
open problem: Do local duplications actually transport and produce regulatory information? 

This is the topic of the present paper. We show that local duplications have left a striking signature 
in the fly genome: the majority of transcription factor binding sites in regulatory modules show evidence 
of a duplication event in their evolutionary history. We conclude that over long evolutionary times, local 
duplications are pervasive and crucial for the formation of complex regulatory modules in the fly genome. 
This mode of evolution sets the speed of regulatory evolution and facilitates adaptive changes of promoter 
function. We infer site duplications from their traces in the sequence of neighboring binding sites, but 
most duplication events predate the tandem repeats present in contemporary sequence. This distinguishes 
our study from comparative analysis of regulatory sequence between closely related species [45-49] , which 
can detect the insertion-deletion dynamics of contemporary repeats, but cover only a small window in 
the evolution of regulatory sites. 

The importance of binding site evolution by duplication is grounded in the biophysics of transcription 
factor-DNA interactions: the sequence-dependent probability of binding between factor and site depends 
in a strongly nonlinear way on the binding energy [3] : it takes values close to 1 in an energy range below 
the maximum binding energy, then drops rapidly as the energy decreases further, and is close to in the 
energy range of non-binding sites. This nonlinearity generates strong epistatic effects for point mutations 
within binding sites [13, 50] and, in turn, an asymmetry in the turnover of binding sites. Functional 
sites can rapidly lose their binding affinity to a factor by one or two point mutations. Rapid adaptive 
formation of a site, however, requires a seed sequence with marginal binding, to which positive selection 
for point mutations towards stronger binding can latch on. Such seeds are contained in random sequence, 
but at unspecific positions. Estimates of the rate of site formation based on biophysically grounded 
fitness models suggest that point mutations alone can explain the rapid formation of an individual site 
in a sufficiently large sequence interval, but not the formation of spatially confined agglomerations of 
sites characteristic of regulatory modules [50-52]. As we show in this paper, local sequence duplications 
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generate seeds for new sites specifically in the neighborhood of functional sites. 

Our analysis proceeds in three steps. First, we analyze local sequence similarities in regulatory regions 
of the Drosophila melanogaster genome in a model-independent way. In regulatory modules, we find a 
significant autocorrelation in nucleotide content for distances up to about 70 bp. This autocorrelation 
includes the known contributions of tandem repeat sequences, but it extends to a much larger distance 
range. The signal turns out to be generated by local sequence clusters, a substantial fraction of which are 
functional transcription factor binding sites with similar sequence motifs. In the second part of the paper, 
we infer the evolutionary origin of these correlated pairs of binding sites, using a probabilistic model with 
mutations, genetic drift, and selection. The model compares the likelihood of two alternative histories: 
a pair of sites evolves either independently or by duplication from a common ancestor sequence. The 
duplication is followed by diversification under selection for binding of two (in general different) factors. 
We show that the duplication pathway is the most likely history for pairs of sites with a mutual distance 
up to about 50 bp, and we find evidence that this pathway is specific to regulatory modules of multicellular 
eukaryotes. Finally, we show that the duplication mode has adaptive potential: duplicated ancestor sites 
can act as seeds for the subsequent formation of a novel binding site for the same factor and, notably, 
even for a different factor. 

Results 

Sequence autocorrelation in regulatory DNA 

The most straightforward measure of local similarity in a sequence segment is the autocorrelation function, 
which is defined as the difference between the likelihood c(r) that two nucleotides at a distance of r base 
pairs are identical and mean identity cq of two random nucleotides, A(r) = c(r) — cq. This function 
is straightforward to evaluate from sequence data as given by eq. (2) in Materials and Methods. We 
have obtained the autocorrelation function in 346 regulatory modules of the D. melanogaster genome 
with length of more than 1000 bp identified by REDfly database [53-55]. The results are shown in 
Fig. I (a). In the distance range up to about 70 bp, the function A(r) takes positive values that decay 
with r in a roughly exponential way; this signal is clearly above the noise level. The mean identity is 
evaluated in a local window of 500 bp (changing the window length affects the baseline of this function, 
but not its short-distance behavior). The autocorrelation signal is small and has several potential sources, 
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such as multiple binding sites for similar motifs, microsatellite and minisatellite repeats at short length 
scales [46-49] , homopolymeric stretches of nucleotides characteristic of nucleosome-depleted regions [56] , 
or other local inhomogeneities in sequence composition. As a next step, we will characterize local sequence 
similarity in a more specific way: we will show that mutually correlated nucleotide pairs are not evenly 
distributed over regulatory modules, but occur in local clusters with a characteristic length scale of around 
7 bp. This signal will be analyzed from an evolutionary point of view and be linked to cis-regulatory 
function. 

Sequence motifs and information 

To motivate the following analysis, assume that a given sequence segment is covered by families of sites 
belonging to different motifs. By definition, a motif is a probability distribution Q(a) of genotypes 
a = (on, . . . , a?), which describes a specific set of sequence sites with t consecutive base pairs and is 
different from the background distribution Po(sl). The statistical deviation of a motif from background 
is measured by the relative entropy between these distributions, H(Q\Pq), which is given by eq. (4) in 
Materials and Methods. This quantity determines the average sequence information per site, which is 
often quoted in units of bits [4]. Multiplying H(Q\P ) with the number of sites for each motif and 
summing over all motifs produces a measure of the total sequence information contained in a genomic 
region. 

Well-known motifs in regulatory DNA are the families of binding sites for a given transcription factor. 
In eukaryotic systems, these sites have a typical length of about 5-10 bp and frequency distributions Q 
(called position weight matrices) with a typical information content H rts 6 — 8 bits per site; see the 
recent discussion by [57]. Other motifs can be defined, for example, in nucleosome-depleted sequences 
in eukaryotes and for repeat units in tandem repeats. If all motifs occurring in a given sequence segment 
were known, we could try to predict their sites and evaluate the information content directly. In the 
present part of the analysis, we proceed differently. We only assume that sequence motifs carry a certain 
information content over sites of a given length £, but we make no further assumptions on position 
weight matrices, sequence coverage, or evolutionary origin. We can still recover part of this sequence 
information from those motifs that occur more than once in the sequence segment. A pair of sites of 
length I belonging to the same motif has an average similarity information given by the mutual entropy 
K(c, £\cq), which measures the enhanced similarity c of aligned nucleotides of the site sequences compared 
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to the background similarity Co and is given by eq. (5) in Materials and Methods. Clearly, the similarity 
information between pairs of sites is a somewhat diluted measure of the full information content due 
to motifs. As a rule of thumb, the mutual entropy per site pair, K(c,£\co), recovers about half of the 
sequence information per site, H(Q\P ). For example, binding sites for the same transcription factor are 
strongly correlated, with a typical similarity c w 0.7 and a similarity information K « 3 bits per site pair. 

Here, we want to identify pairs of similar sites at a given distance r and relate them to the sequence 
autocorrelation function A(r) discussed above. Thus, we estimate the total similarity information /Q(r) 
per unit sequence length of all strongly correlated pairs of sites with distance r and length I in regulatory 
modules. This quantity can be defined by constructing a set of site pairs for given r and £ with the 
following properties: (i) Any pair of sites has an average mutual similarity between aligned nucleotides 
above a certain threshold, c > c m i n (£). (ii) The left sites (and, hence, also the right sites) of all pairs 
have no mutual overlaps. This condition is necessary in order to avoid overcounting of mutual similarity 
in overlapping site pairs, (iii) The sum of the mutual similarities of all pairs in the set is maximal 
(see Fig. SI for illustration). This condition is also used to set the similarity threshold c m i D (£). To 
identify a set of site pairs with properties (i) to (iii), we use a dynamic programming algorithm as 
explained in Materials and Methods. This method allows for optimization of sequence length I similar 
to the procedure in local sequence alignment algorithms [58]. In the maximum-similarity set, we record 
the average mutual similarity c(r,£) of aligned nucleotides in site pairs, which determines the mean 
information content per site pair, K(c(r,£) 7 £\c ) (see eq. (5) in Materials and Methods). We also record 
the number n(r,£) of site pairs and determine the excess An(r,£) — n(r,£) — n a (£) over the number 
expected by chance in background sequence, n (£) (see Materials and Methods). The distance-dependent 
total similarity information per unit length in a sequence segment of size L can then be estimated as 
)C e (r) = (An(r,£)/L)K(c(r,£),£\c ). 

Our inference of ICi(r) is related to recent methods for prediction of unknown regulatory modules 
based on their enhanced sequence similarity contained in words of length £ [32-34]. But the evaluation 
of sequence similarity and the goals of the analysis differ: module prediction uses the total similarity in a 
genomic region, which in our setup is given by summation of /Q(r) over all distances r and over different 
word lengths £. Our analysis is limited to known regulatory modules and focuses on the dependence of 
/Q(r) on r and £. A specific part of this signal, obtained from sites with distance r below 50 bp, will be 
associated below with local duplications as prevalent evolutionary mode. 
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Similarity information in regulatory modules of Drosophila 

We evaluate the similarity information in the set of 346 regulatory modules of Drosophila melanogaster 
and in surrounding background sequence. The following features of local sequence similarity can be 
extracted: 

— The total information of local sequence similarity is maximal for motifs of length 1 — 7. Fig. 1(b) 
shows the total similarity information of all detected site pairs in the range of up to 100 bp, JC to t{£) — 
2™ /Q(r), as a function of the site length I. The function /C to t(^) takes its maximum, that is, the 
similarity information is most significant, lot 1 = 7. The signal falls off at shorter length scales, because 
typical motif sequences are only partially covered, and at larger length scales, because uncorrelated 
flanking nucleotides contribute negatively to the similarity information. In this sense, detected motifs 
cover a characteristic length of about 7 bp. A similar length scale has been observed in tandem repeats [45- 
47]. 

— The function K.j(r) takes distance- dependent positive values in the range of up to 50 bp and saturates 
to a positive asymptotic value for larger distances. Thus, its distance dependence is compatible to that 
of the sequence autocorrelation function A(r) shown in Fig. 1(a). This pattern is due to site pairs with 
high mutual similarity, c > 0.85. 

— Correlated binding sites explain a substantial part of the similarity information. We estimate this 
contribution by masking all functional sites [53-55] and re-evaluating the function K,i{r) in their sequence 
complement; see Fig. 1(c). Known binding sites cover about 10% of the regulatory modules, but the signal 
is reduced by about 50%, indicating that these sites are an important source of similarity information. The 
binding site-masked signal is comparable to its counterpart JCy(r) in non- regulatory intergenic sequence. 
— Micro satellite repeats explain only a small part of the similarity information. We identify such repeats 
using the Tandem Repeat Finder [59]. If we remove about 5 % of the sequence in regulatory modules 
as repeats, the similarity information is reduced by less than 10%; Fig. 1(c). This is not surprising, 
because our sequence similarity measure differs from that of repeat analysis. In particular, our measure 
is sensitive to correlated segments on larger distance scales than typical tandem repeats, because it does 
not require a contiguous interval of self-similar sequence in between. 

— Homologous regions in other fly genomes show a consistent form of K.y{r). We analyze homologous 
regions of two other Drosophila species, D. yakuba and D. pseudoobscura (see Materials and Methods). As 
shown in Fig. S2, these putative regulatory modules have patterns 1C-j{r) of very similar overall amplitude 
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and distance-dependence, with enhanced values in the range of up to 50 bp. 

In summary, our model-independent analysis shows that motifs with a characteristic length of about 
7 bp play an important part in the distance-dependent sequence autocorrelation of Drosophila regulatory 
modules. The characteristic length coincides with the typical length of binding sites, and a substantial 
fraction of the signal can be explained by sequence correlations involving known binding sites. Therefore, 
we now focus the analysis on this smaller, but experimentally validated set of sites [53-55] and analyze 
in detail the evolutionary mechanism generating the sequences of neighboring pairs of sites. 

Evolutionary modes of binding sites 

Binding sites are ideal objects to study the production of information by sequence evolution. The sequence 
motif is approximately known for about 70 transcription factors in Drosophila, that is, we can analyze 
the full position-dependent sequence information of these motifs, not just the similarity information of 
motif pairs. Furthermore, there is a simple link between sequence statistics and evolution of binding 
sites: assuming the sequence distribution Q defines a motif at evolutionary equilibrium, its sequence 
information H is proportional to the average fitness effect of its binding sites, H(Q\P ) = N(F), with 
a proportionality constant equal to the effective population size [20,50,51,60]. The fitness contribution 
of a particular binding sequence, F(a), is proportional to its log- likelihood ratio in the distributions Q 
and Pq. The ensemble of these fitness values defines an information-based fitness landscape F for binding 
of a specific transcription factor. These relations between sequence statistics and fitness of binding 
sites quantify our intuition that specific sequences are overrepresented in a motif, because they confer a 
selective advantage over random sequences [4] . If we write the motif distribution Q in the product form 
of a position weight matrix, we obtain an approximate expression for the fitness F(a) in terms of the 
position-specific single- nucleotide frequencies qi (a) in the motif sequence and their counterparts po(a) in 
background sequence: NF(a) = J2i=i l°g[9i( a i)/Po (««)]■ This expression, which is in its simplest form 
already contained in Kimura's U-shaped equilibrium distribution for a two-allcle locus [61], is known as 
Bruno-Halpern model in the context of protein evolution [62] and has been used to infer fitness effects 
of mutations in binding sites [20,50-52,60,63]. Although this additive fitness model neglects fitness 
interactions between nucleotides within binding sites as well as between sites within a regulatory module, 
it is justified for the purpose of this study (see below). 

The fitness landscape F defines the selection coefficient of any change from a state a to a state b 
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of a binding site, AF a b = F(b) — F(a). Together with the effective population size and the mutation 
rates, these selection coefficients determine the evolutionary dynamics of binding sites. In particular, the 
probability G T (b|a) of evolving from an ancestor site a to a descendent site b through a series of point 
substitutions within an evolutionary distance r can be evaluated in an analytical way from the underlying 
substitution matrix [58,64] (see Materials and Methods). 

We can now compare the two modes of binding site evolution introduced above. For any given pair 
of adjacent sites a and b that bind transcription factors A and B, respectively, we want to evaluate 
the likelihood of two different histories of site formation. In the first mode of evolution, the sites are 
assumed to evolve to their present sequence states by point substitutions from independent ancestor 
sequences and under independent selection given by the fitness landscapes Fa and Fb, as illustrated in 
Fig. 2(a). If the selection for binding is assumed to act over a sufficiently long evolutionary time, the 
probability of observing the present sequence states a and b in this independent mode of evolution is 
simply Q J 4(a)Qs(b). This mode of evolution can only result in distance-dependent sequence similarity 
arising from an increased coverage with pairs of adjacent sites with correlated motifs Qa and Qb (evidence 
for this effect will be discussed below). However, it does not generate increased similarity of individual 
pairs of adjacent sites beyond that of their motifs. 

In the second mode of evolution, the sites are assumed to evolve from a common ancestor sequence 
by a local duplication event at a distance r from the present, followed by diversification under selection 
given by separate fitness landscapes Fa and F B - cither the original site is under stationary selection for 
binding factor A and the duplicated site has evolved the new function of binding the B— factor or vice 
versa, as illustrated in Fig. 2(b). In this mode, the present sequences a and b have evolved from their 
last common ancestor c by independent substitution processes with transition probabilities G A (a\c) and 
G T B (b\c). The dynamics results in a joint probability of the form Q r (a, b) = ^ c G , ^(a|c)C] 3 (b|c)Q(c), 
where the distribution of the ancestor sequence is given by Q(c) — [Qa( c ) + Qb( c )]/2 (see Materials and 
Methods). In this mode, distance-dependent sequence similarity arises due to common descent, causing 
the sequences of adjacent sites to be more similar than their motifs Qa and Qb- Importantly, this effect 
is generic and not tied to any functional properties of the transcription factors A and B. Fig. 2(c) shows 
a few examples of enhanced sequence similarity in pairs of adjacent binding sites in regulatory modules 
of D. melanogaster. 

The relative likelihood of common versus independent descent for a specific pair of sites a, b is given by 
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the duplication score 5(a, b) = log[<5 r (a, b)/Q(a)Q(b)]. This score measures the similarity in a gapless 
alignment between the sequences a and b in a specific way: it gauges matches and mismatches depending 
on the weights of aligned nucleotides in their respective binding motifs Qa and Qb- The duplication 
score depends on the parameter r, which we choose by a maximum- likelihood procedure (see Materials 
and Methods) . This parameter describes the expected excess similarity of site pairs related by common 
descent, but it is not a linear clock of divergence time. Simulated evolution of binding site histories 
shows that the maximum-likelihood duplication score reliably distinguishes between site pairs (a, b) with 
common and with independent descent (see Materials and Methods). Below, we use the distribution 
W(S) of duplication scores to infer the mode of evolution prevalent in a given class of site pairs. 

This likelihood analysis goes beyond the inference of the sequence similarity /Q(r) introduced above. 
It can be seen as a decomposition of the distance-dependent similarity between sites into two parts: the 
similarity between their motifs, and the excess similarity of the actual site pairs beyond that of their 
motifs. The first part reflects functional correlations within regulatory modules and is assigned to the 
background model Q(a)Q(b). Only the second part provides evidence for common descent, which is 
gauged by the scoring function S*(a, b). 

Our model scores only the sequence similarity within site pairs and does not incorporate the insertions 
and deletions between the sites after duplication, which determine their relative distance. This is justified, 
because the likely divergence times of most duplicated site pairs are much longer than repeat lifetimes. 
If a site duplicates within a repeat, the relative distance between copies may subsequently undergo 
rapid evolution due to the high indel rates in these regions [46-49]. Given a surplus of insertions over 
deletions in regulatory modules, we expect the relative distance to increase on average [48]. The spacing 
of contemporary sites is then the result of a long-term diffusive insertion/deletion dynamics within the 
repeats active since duplication, most of which have decayed in today's sequence. This leaves the similarity 
of conserved functional sites as the most prominent long-term marker of these dynamics. 

Local sequence duplications in Drosophila 

Using the duplication score S, we have evaluated the sequence similarity of 506 pairs of neighboring 
binding sites in regulatory modules of the Drosophila melanogaster genome. These sites are experimen- 
tally validated and recorded in the REDfly database [53-55] (see Materials and Methods). We infer the 
prevalent mode of evolution as a function of the distance r between sites and obtain the main result of 
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this paper: 

— In fly, binding sites with a distance of up to about 50 bp are more likely to share a common ancestor 
than to have evolved from independent origins. Fig. 3(a) shows the histogram of duplication scores S (a, b) 
for the set of k = 306 binding site pairs with r < 50 bp. The score distribution W(S) of these pairs 
is clearly distinguished from the background distribution Qq(S), which is obtained from pairs of sites 
located in the same module at a distance r > 200 bp and is associated with independent descent. We 
decompose the score distribution of adjacent sites in the form W(S) = (1 — X)Qo(S) + XQ(S), attributing 
the excess of large scores to pairs of sites of common descent with a score distribution Q(S). Our best 
fit of this mixed-descent model to the data distribution has a fraction A = 57% of adjacent site pairs 
formed by duplication; see Fig. 3(a). The total log-likclihood of the mixed-descent model relative to the 
background model is given by multiplying the relative entropy of the distributions W and Qo with the 
number of site pairs, £ = k H{W\Qq). We estimate S > 234, providing significant statistical evidence 
that the prevalent mode in adjacent sites is evolution from common descent (for details, see Materials and 
Methods). We note that this significance emerges for the ensemble of the adjacent site pairs, whereas the 
relative log-likelihood for duplication per site pair, H(W\Qo), is of order one: individual site sequences 
are inevitably too short to reliably discriminate between the two evolutionary modes. Our conclusion 
that local sequence duplications generate the observed excess similarity of adjacent sites is supported by 
a number of further controls and a comparison with the yeast intergenic regulatory sequences: 
— The relative log-likelihood for duplication per site pair decreases with increasing distance r between sites. 
In Fig. 3(b), we evaluate the relative entropy H(W r \Q ) for the score distributions W r (S) of site pairs 
with different values of mutual distance r. We find a rapid decay up to about 100 bp, that is, the score 
distribution W r becomes successively more similar to the background distribution Q with increasing site 
distance. This pronounced distance-dependancc is comparable to that of the total sequence similarity 
shown in Fig. 1(c) and is consistent with local duplications as underlying mechanism. 
— Similarity of neighboring sites is not limited to specific pairs of transcription factors. We partition 
the 306 site pairs with a mutual distance of less than 50 bp by factor pairs and evaluate the partial 
score averages {S)ab- We compare the distribution of these averages with the corresponding distribution 
of averages evaluated after scrambling the score values of the site pairs, as shown in Fig. 3(c). The 
two distributions are statistically indistinguishable, which shows that excess sequence similarity is a 
broad feature of adjacent binding sites and is not limited to a subset of sites for factor pairs with specific 



12 



functional relationships. This supports our conclusion that the excess sequence similarity reflects common 
descent and not fitness interactions (epistasis) between sites. Of course, epistasis is common for binding 
sites in the same regulatory module, because these sites perform a common regulatory function. However, 
generic interactions couple the binding energies of adjacent sites, not directly their sequences. Epistatic 
effects generating excess sequence similarity are conceivable for specific factor pairs, but do not appear 
to be a parsimonious explanation for the broad similarities of adjacent binding sites we observe. 
— In yeast, binding site duplications are not frequent. For comparison, we have also evaluated a set of 1352 
pairs of binding sites in the Saccharomyces cerevisiae genome. Fig. 3(d) shows distribution of duplication 
scores 5(a, b) for the set of binding sites with r < 50 bp. This distribution is strongly peaked around zero 
(because the maximum-likelihood value of r is large, see Materials and Methods) and indistinguishable 
from the distribution of the control set of random site pairs; both distributions have a negative average. 
As in Drosophila, most binding sites in the same intergenic region of S. cerevisiae are located within 50 
bp from each other. However, we do not observe evidence for local duplications as a mode of binding site 
formation in yeast. Clearly, this result does not exclude that such duplications take place, but they do not 
appear to be frequent enough to generate a statistically significant excess similarity of neighboring sites. 
This is not surprising given the differences in regulatory architecture between yeast and fly: individual 
sites in S. cerevisiae are more specific than in Drosophila; the average sequence information of a binding 
motif is H w 12 — 17 bits, compared to H m 6 — 8 bits [57]. Accordingly, a larger part of the regulatory 
functions in yeast relies on single sites, and there are no regulatory modules which would require frequent 
duplications for their formation. 

Adaptive potential of duplications 

Do the inferred site duplications have adaptive potential for the formation of novel binding sites? Here, 
we use the term adaptive potential to indicate that the duplication itself may be a neutral process, and 
selection for factor binding may latch on later to duplicated sites. The duplication of a site for a given 
transcription factor has obvious adaptive potential towards formation of an adjacent site for the same 
factor. But local duplications also have adaptive potential if the duplicated site is to evolve the new 
function of binding a different factor, because the binding motifs of transcription factors with adjacent 
sites are correlated. This correlation quantifies the ability of one factor to recognize the binding sites 
of another factor, including seed sites generated by sequence duplications. Specifically, we define the 
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binding correlation Bab of a transcription factor A with another factor B as the average information- 
based fitness to bind factor B in the ensemble of A-sites. In Fig. 4, this quantity is evaluated for all factor 
pairs (A, B) with adjacent binding sites, together with the range of fitness Fb of known target sites for 
factor B and the average Fb in background sequence (see Materials and Methods) . For most such factor 
pairs, the fitness of a typical A-site is seen to be similar to that of weak B-sites and significantly larger 
than the average fitness of background sequence. This binding correlation between motifs is sufficient so 
that an ^4-site duplicate can act as a seed for a -B-site, which can subsequently adapt its strength by point 
mutations. The binding correlation is specific to factors which have adjacent binding sites; we have found 
no such effect in the control ensemble of all factor pairs (A, B) (most of which do not have adjacent sites). 
Furthermore, some highly specific motifs, such as hunchback, twi and z do not show binding correlations 
with other factors. 

Discussion 

Local sequence duplication as a mechanism of regulatory evolution 

Local sequence duplications (and deletions) are a generic evolutionary characteristic of intergenic DNA 
and, in particular, of regulatory sequence [44-49]. In this study, we have established evidence for local 
sequence duplications as a mechanism that transports and produces cis-regulatory information. These 
duplications generate specific, distance-dependent sequence similarity in strongly correlated pairs of sites 
with a relative distance of up to about 50 bp, which account for a substantial part of the sequence 
autocorrelation in fly regulatory modules. In particular, they provide a parsimonious explanation for the 
observed excess sequence similarity of transcription factor binding sites in this range of relative distance. 
We conclude that the majority of these adjacent site pairs have a common ancestor sequence. The large 
amplitude of the duplication signal may be the most surprising result of this study. It far exceeds the 
level expected from the repeats in contemporary sequence, which cover only about 5 percent of binding 
sites and are typically shorter than the distance between correlated sites. Common-descent site pairs are 
the cumulative effect of past duplications over macro-evolutionary intervals, whose trace is conserved by 
selection on site functionality. 

This result establishes local duplication as a pervasive formation mode of regulatory sequence, which 
generates, for example, the known local variations in site numbers between Drosophila species. Of course, 
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our evidence for this mode is statistical and, at this point, is confined to a limited dataset of binding 
sites with confirmed functionality [53-55]. The duplication mode appears to be specific to multicellular 
eukaryotes; we have not found comparable evidence in the yeast genome. Our findings are relevant for 
genome analysis in two ways: including local duplications should inform inference methods for binding 
sites as well as alignments of regulatory sequence with improved scoring of indels [46-49]. With such 
methods, it may become possible to follow the evolutionary history of binding sites in more detail. 

Life cycle of a binding site 

We have found evidence that local duplications can confer adaptive potential for the formation of novel 
binding sites, because they generate seed sequences with marginal binding specifically in the vicinity of 
existing sites. This mechanism is necessary, because point mutations alone can only lead to rapid loss but 
not to gain of new sites with positional specificity. Thus, duplications and point mutations complement 
each other, suggesting that typical binding sites within multicellular eukaryotes have an asymmetric 
life cycle: formation within a functional cluster by local duplication, adaptation of binding energy by 
point mutations, evolution of relative distance to neighboring sites by insertions and deletions in flanking 
sequence, conservation by stabilizing selection on binding energy, and loss by point mutations. 

The life cycle of individual binding sites interacts with other levels of genome evolution. Gene du- 
plications with subsequent sub-functionalization have been identified as an important evolutionary mode 
specifically in higher eukaryotes [65]. If subfunctionalization is initialized at the level of gene regulation, 
it amounts to a loss of regulatory input for both gene duplicates and provides a mechanism for adaptive 
loss of binding sites. This process alone would lead to genomes with many genes, but few functions per 
gene. Maintaining regulatory complexity with multi-functional genes as observed in eukaryotic genomes 
[23,26] requires a converse evolutionary mode: gain of new functions by existing genes. At the regulatory 
level, this amounts to gain of regulatory input, i.e., adaptive formation of new binding sites. 

Sequence evolution and regulatory grammar 

Previous studies have identified regulatory modules as important units of transcriptional control, in which 
clusters of binding sites bind multiple transcription factors with cooperative interactions. The sites in 
a cluster follow a regulatory grammar resulting from natural selection acting on site order, strength, 
and relative distances [36-38]. If sequence duplications play a major role in the formation of such 
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clusters, we may ask how much of their observed structure reflects this mode of sequence evolution, rather 
than optimization of regulatory function by natural selection. Local duplications generically produce 
descendant sites, which are weak binding sites for another factor at best, as shown in Fig. 4. (Significant 
heterogeneity in binding strength between adjacent sites is indeed observed in our sample.) The resulting 
binding sequences are hardly optimal in terms of specificity and discrimination between different factors. 
Cooperative binding between transcription factors may have evolved as a secondary mechanism to confer 
regulatory function to these sequence structures. 

In this paper, we have argued that local sequence duplications facilitate the adaptive evolution of 
gene regulatory interactions. However, the adaptive potential of duplications does not imply that the 
duplication process itself has to be adaptive or even confined to regulatory sites. Similar to gene duplica- 
tions [65], many site duplications may be neutral and provide a repertoire of marginal regulatory links. 
Adaptive diversification can build subsequently on this repertoire, conserving and tuning those links that 
confer a fitness advantage and discarding others. 

Materials and Methods 

Regulatory sequences and position weight matrices 

The sequence analysis of D. melanogaster is based on the ds-regulatory modules and experimentally 
validated binding sites collected in the REDfly v. 2. 2 database [53-55], and on the position weight matri- 
ces of Dan Pollard's dataset (http://www.danielpollard.com/matrices.html). To measure the distance- 
dependent sequence similarity /C^(r), we use the 346 known regulatory modules with length of more than 
1000 bp in D. melanogaster. The analysis in D. yakuba and D. pseudoobscura is based on the 249 well- 
aligned homologous regions obtained from multiple alignments of 12 Drosophila species (dm3, BDGP 
release5); see Fig. S2. For the evolutionary inference in the second part of the paper, we use only the 
experimentally validated binding sites contained in these modules which are not necessarily selected for 
high similarity to motifs or for high mutual similarity. To avoid biases in our analysis, the set of sites is 
truncated in three ways: (i) We only use binding sites for transcription factors that occur in at least two 
different regulatory modules, so that the position weight matrix is not biased by the sequence context 
of a single module, (ii) We use only sites that have no sequence overlap with other sites in the dataset, 
because our inferred fitness landscapes describe the selection for a single regulatory function [13]. (iii) We 
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exclude sites in the X chromosome, which could bias the results by its high rate of recent gene dupli- 
cations and the abundance of repeat sequences [66,67]. These conditions produce a cleaned set of 506 
transcription factor binding site pairs located in 74 cis-regulatory modules. For the analysis in S. cere- 
visiae, we use sites and position weight matrices from the SwissRegulon database [68]. These footprints 
do not always match the length i of their position weight matrices. To produce a set of site sequences of 
common length £, longer footprints are cut and shorter ones joined with flanking nucleotides, such that 
the binding affinity is maximized. 

Sequence information measures 

Sequence autocorrelation is a measure of enhanced mean similarity between the nucleotides of a sequence 
segment. The distance dependence of the autocorrelation signal provides information about the range, 
within which the nucleotides appearing in the sequence are correlated. In a given sequence segment 
a\ , . . . , at , the nucleotide frequencies are given by 

1 L 

Po(a) = -^2S( ai/ ,a), (1) 

V— 1 

where 5{a Vl a) = 1 if a v = a and 8{a v ,a) = otherwise. These determine the mean similarity between 
two random nucleotides of the segment, c = J2 a Po( a )- The sequence autocorrelation function is then 
defined by 

1 L ~ r 

A{r) = -c + - y)j(a„,a„ +r ). (2) 

L — r z — ' 
i/=i 

We evaluate this function in the 346 regulatory modules of Drosophila melanogaster genome with length 
of more than 1000 bp identified by REDfly v. 2. 2 database [53-55]. As shown in Fig. 1(a), we find an 
approximate exponential decay with a characteristic length of about 100 bp as the range of sequence 
correlation. The mean identity Co is evaluated in a local window of 500 bp (changing the window length 
affects the baseline of this function, but not its dependence on distance up to 100 bp). Information about 
the spatial distribution of these correlated nucleotides along the genome is contained in higher orders of 
sequence autocorrelation (i.e., reoccurrence of doublets, triplets, etc.). Here, we use information theory 
to identify such clusters of correlated nucleotides in a sequence region. 

We want to detect reoccurring nucleotide patterns or motifs. A motif of length I is a probability dis- 
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tribution Q(a) for sites a = {a\, . . . ,ai) which differs significantly from the background distribution 
Po(a). If we neglect correlations between nucleotides, we can write these distributions as the product of 
single-nucleotide frequencies, 

i 

Q(a) = n?*(°*) ( 3 ) 

»=1 

and Po(a) = rii=iPo(«j)- The Ax £ matrix of single-nucleotide frequencies (3) is called the position 
weight matrix of the motif. The sequence information of the motif is defined as the relative entropy 
(Kullback-Leibler distance) of these distributions [69] , 

ff(0|F„)=^ g ,(a)lo g ^|. (4) 

i=l a 

To study the sequence coverage by informative motifs, we use a reduced form of the full frequency 
distribution Q by mapping it to the mean similarity of its motif sites. Hence, even without any prior 
knowledge on frequency distributions, we can recover part of the sequence information for those motifs 
that occur more than once in the sequence segment. Two sites drawn from the motif have a mean similarity 
c = J2i a ( li( a ) between aligned nucleotides, which is higher than the background mean similarity c . The 
similarity information of the motif is given by the relative entropy 



K(c,£\c ) 



c 1 — c 

clog h(l-c) log- 

c 1 — c 



(5) 



Similarity information between pairs of sites is a somewhat diluted measure of sequence information. As 
a rule of thumb, the mutual similarity entropy per site pair, K(c,£\c ), recovers about half of the motif 
information per site, H(Q\P ). 

Inference of similarity information by dynamic programming 

To estimate the total similarity information Kg(r) of all strongly correlated pairs of sites with distance r 
and length £ in a sequence segment of length L, we construct a set 

{(a Vl ,..., a„ 1+ £_i), (a„ 1+r , . . . , a„ 1+r+ £_i)}, . . . , {{a Vn , . . . , a Vn+ e-i), {a Vn+r , . . . , a Un+r+ ^i)} (6) 
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of site pairs with the following properties: 

(i) The left (and also the right) sites of all pairs have no sequence overlap, 

v a +i - v a >l fora=l,...,n-l. (7) 

(ii) The mean similarity of each pair is greater than a threshold c m ; n , 

1 1 

c a = - f S(a Va+i , a Va+r+i ) > c min for a = 1, . . . ,n. (8) 
1 i=i 

(iii) The sum of mutual similarities X^«=i c a is maximal (see Fig. SI) 
By the dynamic programming recursion 

1 e 

C t = max[C t _i, C t ~i + [- 5(a t _ e+i ^ r , a t ~e+i)} - c min ], (9) 
1 i=i 

we obtain the sequence of partial scores C\,...,Cl with the initial condition C\ — 0. We then use 
a backtracking procedure (see, e.g., [58]) to determine the set of positions (v\, . . . ,v n ) and, hence, the 
number n(r,£,c m ; n ) and the average similarity c(r,i?,c m in) = [C^jn) + c m ; n of the high-similarity pairs 
(6). To estimate the expected number of pairs in background sequence, no(r, £, c m ; n ), we apply the same 
procedure to 1000 sequences of length L, which are generated by a first-order Markov model 

L 

P(oi, ...,a L )= po(«i) J] T KK-i) ( 10 ) 

with the same single-nucleotide frequencies Po(a) and conditional frequencies T(a\b) as in the actual 
sequence. We then evaluate the excess An(r, I, c min ) = n(r, £, c m ; n ) — n (r, £, c min ) and obtain an estimate 
of the total information contained in the enhanced autocorrelation of motifs as given by eq. (5), 



fCi(r) = Imax 



An r,l,c miri / c(r,£,c min ) 1 - c(r, £, c min ) 

log + log = 

c 1 - c 



(11) 



We infer c m i n by maximum likelihood analysis of the total similarity information in the sequence. This 
method also allows for optimization of the motif length £, similar to the procedure in the local sequence 
alignment algorithms [58]. 
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Evolutionary model for binding sites 

Our evolutionary dynamics of binding site sequences a = (a\, . . . ,a^) for a given transcription factor 
is determined by the Bruno-Halpcrn fitness model [62] derived from the position weight matrix qi(a) 
(i = 1, . . . ,£) and the background frequencies Po(a), 

NF( a ) = J2^ with / i (a)= log ^T. (12) 

This relationship between fitness and nucleotide frequencies is valid if binding sites are at evolutionary 
equilibrium under mutations, genetic drift, and selection, and background sequence is at neutral equilib- 
rium (accordingly, all inferred fitness values are scaled in units of the effective population size N). The 
relationship of the evolutionary ensembles with the underlying thermodynamics of site-factor interactions 
is discussed, for example, in ref. [52]. Eq. 12 defines an information-based fitness model: the average 
fitness of functional binding sites equals the sequence information of the motif, 



(F) = H(Q\P ) (13) 

with Q(a) = Qi( a i) an d -Po( a ) — lli=iPo( a i); scc cc l s - (3) and (4). We infer Po(a) from the 

local background frequency of the region 500 base pairs around each binding site. The rates u a ^b 
of point substitutions a — > b within binding sites are determined by the scaled selection coefficients 
NAF a b — N[F(h) — F(a)] derived from this fitness model and the point mutation rates /U a -^b (which are 
assigned a uniform value \x for simplicity). Here, we use the standard Kimura-Ohta substitution rates 

NAF ah 

^b = ^b 1 _ exp( _ 7VAFab) , (14) 

which are valid in the regime \iN <C 1 (in which subsequent substitution processes are unlikely to overlap 
in time) and AF a b <C 1 [61,70]. The matrix of these substitution rates then determines the transition 
probabilities (propagators) G r (b|a) from an arbitrary initial sequence a to an arbitrary final sequence 
b within an evolutionary distance r [58,64]. Given the set of transition probabilities, we obtain the 
joint probability Q T (a, b) for a pair of sites (a, b) that bind transcription factors A and B, respectively, 
and have evolved from a common ancestor c as described in the main text and in Fig. 2(b). First, we 
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assume that the ancestor site is at evolutionary equilibrium under selection to bind factor A, that is, the 
contemporary site a has the ancestral function and b has evolved a new function after duplication. This 
gives the contribution 

Qi(a,b) = ^G^(a|c)G^(b|c)Q A (c) 

C 

= ^G^(b|c)G^(c|a)Q A (a), (15) 

c 

where we have used the detailed balance condition of the substitution dynamics [64]. There is a second 
contribution Q T B (b, a) describing the case of the ancestor c under stationary selection to bind factor B. 
Weighing these cases with equal prior probabilities, we obtain 

Q r (a,b) = ^[g^(a,b) + Q^(b,a)]. (16) 

In our analysis of pairs of adjacent binding sites in Drosophila, there is usually a dominant contribution 
from one of the terms, from which we can infer the likely function of the ancestor site. In the limit of 
large t, the evolution from a common ancestor becomes indistinguishable from evolution by independent 
descent, limbec Q T (a, b) = Q A (a)Q B {h). 

Inference of common descent 

The duplication score 

s(a ' b, = '° g dTOT (I7> 

is a measure of sequence similarity between binding sites. This score depends on the evolutionary dis- 
tance parameter r. We infer the optimal value of r by maximizing the likelihood ratio between the score 
distribution of pairs with mutual distance r < 50 and the score distribution of pairs with independent 
origin. In D. melanogaster, we find a finite maximum-likelihood evolutionary distance r w 0.4^ _1 and 
significantly positive values of the duplication score for adjacent binding sites. In S. cerevisiae, we find 
large values r 3> i- e -> there is no statistical evidence for evolution by common descent. Our conclu- 
sions are largely independent of the values of r used in (16) and (17). These values should be regarded as 
model fit parameters for the observed sequence similarities. Energy-based fitness models [13,64], which 
take into account the epistasis between mutations within binding sites, are required to obtain more ac- 
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curate estimates of r, which can be tested against phylogenetic data. Epistasis will increase the inferred 
values of r compared to the additive (Bruno-Halpern) model [13,64]. 

We evaluate the score distribution W(S) of a given class of site pairs in terms of a mixture model of 
common and independent descent, 



W{S) = (l-X)Qo{S) + XQ(S). (18) 

The distribution of scores for independent descent, Qo(S), is obtained from pairs of sites in a common 
module with a relative distance r > 200 bp (Fig. 3(a), dashed line). This distribution is approximately 
Gaussian and has a width of order one, which is consistent with the simulations reported below. Because 
we build Q from sites in a common module, its score average is above that for pairs of sites located 
in different modules. In this way, the overall sequence similarity within modules, which depends on the 
local GC-content, is assigned to the background model and does not confound the evidence for common 
descent. The distribution Q(S) is the best fit to the the large-score excess of the distribution W(S) for 
adjacent sites with a relative distance r < 50 bp (Fig. 3(a), violet-shaded). This distribution has larger 
mean and is broader than the background distribution Q , which is also consistent with the simulations 
reported below. 

Given a set of k site pairs (a, b) with scores 5(a, b) described by the distribution W(S), the log- 
likelihood of the mixed-descent model (18) relative to the independent-descent background model is 
given by 



site pairs 



site pairs 



0(S(a,b)) 



Q (5(a,b)) 



(19) 



it equals the product of the number of sites and the relative entropy H(W\Qo). The extensive quantity E 
measures the statistical evidence for the mixture model based on the number and the score distribution 
of site pairs, whereas H(W\Qo) quantifies only the shape differences between the distributions W(S) and 
Qo{S). We evaluate eq. (19) using the conservative estimate Q(S)/Qo(S) > exp(5 — So) with So = 0.7; 
see Fig. 3(a). 

We have tested our inference procedure by simulations of the sequence evolution for pairs of binding 
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sites with common and with independent descent. For these simulations, we use four pairs of differ- 
ent factors {A, B} — {ftz, bed}, {ftz, abd — A}, {bed, abd — A}, {bed, Kr}, and two pairs of equal factors 
{A, B} = {ftz, ftz}, {bed, bed}. For each factor pair, we obtain an ensemble of 25000 pairs of binding 
sites (a, b) with a duplication in their evolutionary histories, as described by Eqs. (15, 16) and Fig. 2(b). 
We first obtain 500 duplication events (c, r): the last common ancestor sequence c is drawn with equal 
likelihood from the ensemble Qa(c) or Qb(c), and the divergence time r is drawn from an exponential 
distribution with mean f = 0.4/ /i. For each duplication event, we draw 50 site pairs (a, b) from the distri- 
bution Q\(a\c)Q B (b\c) describing evolution under selection for binding of factors A and B, respectively. 
We then apply our scoring procedure to this set of site pairs. As for the real sequence data, we infer a 
single maximum-likelihood parameter tml by maximization of the total duplication score X. As shown in 
Fig. S3(a), X has a pronounced maximum at a value tml ~ 0.3//i, which is close to the mean divergence 
time f of the input data. We conclude that the constraint of a fixed r does not confound the inference 
of common descent. We also obtain separate score distributions for sites (a, b) binding the pairs (A, B) 
of equal factors and of different factors listed above; see Fig. S3(b) and Fig. S3(c). These distributions 
are similar and clearly distinguish duplicated site pairs from pairs with independent ancestries for both 
factor groups. We conclude that our method can infer common descent of binding sites, independently 
of their functional characteristics. 

Binding correlation of transcription factors 

We define the binding correlation Hab for each ordered pair of factors (A, B) as the average information- 
based fitness of A-sites for the -B-factor, 



This value is an estimate for the compatibility of the A-sites with the transcription factor B and equals, 
up to a constant, the information-theoretic cross entropy between the distributions Qa and Qb- In Fig. 4, 
this quantity is compared to (i) the sequence information H B of the motif Q B , which equals the average 
fitness of -B-sites for the B-factor by eq. (4) , 




qB,i( a ) 

Po(a) 



(20) 




(21) 



,a 
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and (ii) to the average fitness of background sequence for the £>-factor, 

H 0B = Y,Mb)fB,i(b)- (22) 

i,b 
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Figure 1. Sequence similarity in regulatory modules of the fly genome, (a) Sequence 
autocorrelation A(r) as a function of distance r, obtained from 346 regulatory modules in D. melanogaster 
(gray: unbinned data, blue: binned in intervals of variable length). The autocorrelation values are positive and 
depend on r in a roughly exponential way up to about 70 bp. (b) Total similarity information 
/Ctot(^) = X^r=°i fci{ r ) as a function of motif length I for all pairs of strongly correlated sites with mutual 
distance r < 100 bp in the same set of regulatory modules. This function takes its maximum at a characteristic 
motif length of I = 7 bp. (c) Distance-dependent similarity information K,7(r) for motif length 1 = 7 evaluated 
in all sequence (red), binding site-masked sequence (green), repeat-masked sequence (blue) in regulatory 
modules, and in generic intergenic sequence (black). Repeat-masked sequence is generated using the Tandem 
Repeat Finder [59] with match-mismatch-indel penalty parameters (2,3,5). Insert: Total similarity information 
/Ctot(^=7) for the same sequence categories. Binding sites, but not tandem repeats, account for a substantial 
fraction of the similarity information. 
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Figure 2. Evolutionary modes of transcription factor binding sites. The figure shows alternative 
formation histories for two adjacent binding sites, whose present sequences bind transcription factors A and B, 
respectively. The color coding indicates the evolution of binding function for factor A (red) and B (blue) with 
evolutionary time t. (a) Evolution from independent ancestor sequences. The sites evolve to their present states 
by independent evolutionary processes under stationary selection given by different fitness landscapes Fa and 
Fb (see text). In this mode, adjacent sites will show no enhanced average sequence similarity compared to the 
similarity of their motifs, (b) Evolution by duplication of a common ancestor sequence. Left panel: The original 
site evolves in the stationary fitness landscape Fa- At a distance r from the present, this site undergoes a 
duplication. The duplicated site evolves its new function of binding B in the fitness landscape Fb- Right panel: 
The same process with the roles of A and B interchanged. In the duplication mode, the sites retain an enhanced 
sequence similarity, which reflects their common descent, (c) Examples of adjacent functional binding sites with 
enhanced sequence similarity in the D. melanogaster genome. The sites of each pair are aligned. The color 
background of nucleotide a at position i indicates its contributions to fitness (binding affinity) for factor A and 
B, i.e., fi,A{a) (level of red) and /i,s(a) (level of blue). The sequence similarity leads to hybrid binding 
characteristics: some nucleotides of the ^4-site (top row) have binding characteristics of the B-motif, and vice 
versa. Examples from top to bottom (factor A / factor B, genomic positions, duplication score): (i) Kruppel / 
hunchback, chr3L: 8639822 / 8639878, S = 4.40, (ii) zeste / Trithorax-like, chr3R: 12560236 / 12560218, 
S = 3.97, (iii) Kruppel / tailless, chr3L: 8639586 / 8639596, S = 3.40, (iv) pangolin / apterous, chr3R: 22997722 
/ 22997752, S = 2.38. 
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Figure 3. Common vs. independent descent of binding sites in fly and yeast, (a) Histogram of the 
duplication score S for 306 pairs of binding sites with a mutual distance of up to 50 bp in the genome of 
D. melanogaster (sum of grey-shaded and violet-shaded part). Decomposition of counts according to the 
mixed-descent model (see Materials and Methods): 43% of the site pairs are of independent descent and have 
the score distribution Qo(S) (obtained from pairs with relative distance r > 200 bp, dashed line), 57% of the 
site pairs of are of common descent and have the score distribution Q(S) (violet-shaded), (b) Relative 
log-likelihood for duplication per site pair, i.e., relative entropy H(W r \Qo) obtained from the score distribution 
W r (S) of site pairs in the relative distance range (r, r + 15) bp (evaluated from a total of 506 sites). The rapid 
decay of this function suggests a local mechanism generating excess similarity between adjacent sites, 
(c) Histogram of partial score averages {S)ab for all factor pairs (A,B) binding the site pairs of (a) 
(grey-shaded) and corresponding distribution of averages obtained after scrambling the score values of site pairs 
(normalized to the same number of total counts, dashed line). The two distributions are statistically 
indistinguishable (KS-test p-value = 0.8378), which shows that positive duplication scores are not limited to a 
subset of factor pairs, (d) Histogram of the duplication score S for 833 pairs of binding sites with a mutual 
distance of up to 50 bp in the genome of S. cerevisiae (grey-shaded). The distribution is not significantly 
different from the null distribution obtained from random site pairs (normalized to the same number of total 
counts, dashed line), i.e., there is no evidence for common descent as prevalent evolutionary mode. 
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Figure 4. Adaptive potential of binding site duplications. The binding correlation Hab of all pairs of 
Drosophtla transcription factors (A, B) which have adjacent binding sites in a common regulatory module is 
evaluated as the average information-based fitness of A-sites for factor B and plotted against the sequence 
information Hb of the binding motif of factor B (blue crosses); see eqs. (20) and (21) in Materials and Methods. 
The binding correlation is compared to the distribution of fitness values Fb of the B-sites (red dots, the average 
fitness for each factor is shown as diamond and equals the abscissa Hb) and to the average fitness Fb in 
background sequence (green dots); see eq. (22) in Materials and Methods. The binding correlation Hab is 
significantly larger than the background average of Fb and is comparable to the fitness Fb of weak _B-sites in a 
substantial fraction of cases. Some highly specific motifs, such as hunchback, twi and z do not show binding 
correlations with other factors. 



^ / ^<TGCTGCAGTAAACG TGCGGC^r.A GTAACTGATAATfic Gr^ 

TGgTGCAGrA AACGTGCGGCAATA GrAACrGflrA ATAcGTAACTGCTAccATCTflCTCTArrGrACArrccTRATCGTACGTI 



Figure SI. Motif detection in sequence segments (schematic). The figure shows a configuration of 
correlated sequence sites of length I — 10 bp and distance r = 14 bp from each other. Pairs of correlated sites 
have the following properties: (i) The average mutual similarity between aligned nucleotides is larger than a 
given threshold, c > c m i n = 0.8. (ii) The left sites (and, hence, also the right sites) of all pairs have no common 
nucleotides. This condition is necessary in order to avoid overcounting of mutual similarity in overlapping site 
pairs, (iii) The sum of the mutual similarities of all pairs in the set is maximal. In the example shown, there are 
three different motifs with reoccurring sequence patterns marked by different colors (red, blue, green). To 
illustrate the alignment of the site pairs, we shift the whole sequence by r — 14 bp in the second row. The left 
and right site of each motif are shown in boldface in the first and the second row, respectively. Mismatches 
between aligned sites of the same motif are shown in boldface gray letters. The flanking regions separating the 
correlated sequence pairs are shown in smaller font. 



34 



x 10" 




10 




x 10- 



10 



100 150 200 

r (bp) 




100 150 200 

r (bp) 



Figure S2. Sequence similarity in regulatory modules of 3 Drosophila species. Distance-dependent 
similarity information Kr(r) for motif length £ — 7 in regulatory modules (red) and in generic intergenic 
sequence (black), evaluated in D. melangaster and in the homologous regions of D. yakuba and D. pseudoobscura 
(see Materials and Methods). These data show a consistent pattern of overall amplitudes and of decay lengths. 
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Figure S3. Tests of the duplication inference method. We simulate binding site pairs (a, b) evolving 
by common descent or by independent descent, as described in Materials and Methods, (a) Dependence of the 
total duplication score E on the time parameter r for an ensemble of 150000 site pairs of common descent. This 
function has a pronounced maximum at a value tml ~ 0.3/ fi, which is close to the mean divergence time 
r = 0.4/n since duplication, (b) Distributions of the score S (with r = tml) for pairs of sites binding different 
factors. The distribution for sites of common descent (filled curve) is distinguished from the distribution for 
sites with independent descent (solid curve) by its increased score average, (S) — {S)o = 2.1, and by its increased 
width, (c) Same as (b) for pairs of sites binding the same factor. The distribution for sites of common descent 
(filled curve) has again an increased average, (S) — (S)o = 1.6, and an increased width. 



